Spot instances and recovering from shutdown
Requesting spot or preemptible instances is a way of reducing the compute cost while using powerful instances but the cloud provider can take back the instance at any moment and disrupt the workload.
Requesting a spot instance with dataplane
On AWS EKS or Azure AKS
Requesting a spot instance has to be requested from the manifest:
...
spec:
...
types:
Worker:
...
resources:
...
extraSelectors:
karpenter.sh/capacity-type: spot
By default AIchor experiments will request karpenter.sh/capacity-type: on-demand.
On GCP GKE
...
spec:
...
types:
Worker:
...
resources:
...
extraSelectors:
cloud.google.com/gke-spot: "true"
cloud.google.com/gke-provisioning: spot
# on GKE, a toleration also has to be passed
extraTolerations:
- key: "cloud.google.com/gke-spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Recovering from eviction
Recovering from eviction is supported by 2 AIchor operators: jobset and kuberay.
You can find some demo projects using these 2 operators here and here. Other operators will fail when the eviction happens.
spec:
operator: jobset # or kuberay
restartPolicy:
backoffLimit: 5
In the snippet code above, spec.restartPolicy.backoffLimit represents the number of allowed restarts, this experiments will be able to handle 5 failures (including eviction) before being marked as failed.
Side note for the other operators
spec.restartPolicy.backoffLimit on the other operators (jax,pytorch,...) only covers a software failure, when the exit status of spec.command is a non-zero code. It will re execute the command in the same container, on the same node.
Checkpointing
During training, the most reliable method to ensure progress is not lost is to periodically save checkpoints to an external storage backend (like AIchor S3 buckets). Then, when restarting the software should be able to automatically recover from the latest checkpoint to resume training.This way, you ensure a reliable training while using spot instances.